library(tidyverse)
library(DT)
library(lubridate)
library(ggthemes)
library(RColorBrewer)Exploratory Data Analysis
Introduction
- Exploratory data analysis involves the use of graphics, summaries and transformations to better understand the data and answer our questions.
- Exploratory methods, make few assumptions about the distributional shape of the data.
- Our results can help us refine our questions or develop new ones.
Here’s what Wickham says in “R for Data Science”: (page 82)
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
There are no “real” rules here, but the data I’m using today comes from a competition, so obviously winning was (the competition is closed) the goal. But to win, it was important to focus on two things
- Variation
- Covariation
Loading Libraries
Case Study
- This example is taken from Kaggle Shelter Outcomes https://www.kaggle.com/c/shelter-animal-outcomes
library(readr)
train <- read_csv("train.csv.gz")A .gz or a .zip is a compressed file. Compression saves space on our machines. The functions in readr can read those directly (some functions require decompression first)
- The dataset train contains the following variables, take a look:
datatable(train)more compactly
glimpse(train)Rows: 26,729
Columns: 10
$ AnimalID <chr> "A671945", "A656520", "A686464", "A683430", "A667013", …
$ Name <chr> "Hambone", "Emily", "Pearce", NA, NA, "Elsa", "Jimmy", …
$ DateTime <dttm> 2014-02-12 18:22:00, 2013-10-13 12:44:00, 2015-01-31 1…
$ OutcomeType <chr> "Return_to_owner", "Euthanasia", "Adoption", "Transfer"…
$ OutcomeSubtype <chr> NA, "Suffering", "Foster", "Partner", "Partner", "Partn…
$ AnimalType <chr> "Dog", "Cat", "Dog", "Cat", "Dog", "Dog", "Cat", "Cat",…
$ SexuponOutcome <chr> "Neutered Male", "Spayed Female", "Neutered Male", "Int…
$ AgeuponOutcome <chr> "1 year", "1 year", "2 years", "3 weeks", "2 years", "1…
$ Breed <chr> "Shetland Sheepdog Mix", "Domestic Shorthair Mix", "Pit…
$ Color <chr> "Brown/White", "Cream Tabby", "Blue/White", "Blue Cream…
Initial Steps
- The data needs to be prepared or processed for analysis (data cleaning)
- An important issue will always be missing values.
- na.omit( ) or complete.cases() can be used. (In more advanced courses, substitution of missing values instead of deletion)
missing values
train %>% summarise_all(funs(sum(is.na(.))))# A tibble: 1 × 10
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome
<int> <int> <int> <int> <int> <int> <int>
1 0 7691 0 0 13612 0 1
# ℹ 3 more variables: AgeuponOutcome <int>, Breed <int>, Color <int>
substitutions
train <- train %>%
mutate(NoName = is.na(Name))table(train$NoName)
FALSE TRUE
19038 7691
Make some decisions about the remainder. Real life data is always messy so make sure you use the useNA = option when tabling data:
table(train$OutcomeSubtype, useNA = "always")
Aggressive At Vet Barn Behavior
320 4 2 86
Court/Investigation Enroute Foster In Foster
6 8 1800 52
In Kennel In Surgery Medical Offsite
114 3 66 165
Partner Rabies Risk SCRP Suffering
7816 74 1599 1002
<NA>
13612
table(train$SexuponOutcome, useNA = "always")
Intact Female Intact Male Neutered Male Spayed Female Unknown
3511 3525 9779 8820 1093
<NA>
1
table(train$AgeuponOutcome, useNA = "always")
0 years 1 day 1 month 1 week 1 weeks 1 year 10 months 10 years
22 66 1281 146 171 3969 457 446
11 months 11 years 12 years 13 years 14 years 15 years 16 years 17 years
166 126 234 143 97 85 36 17
18 years 19 years 2 days 2 months 2 weeks 2 years 20 years 3 days
10 3 99 3397 529 3742 2 109
3 months 3 weeks 3 years 4 days 4 months 4 weeks 4 years 5 days
1277 659 1823 50 888 334 1071 24
5 months 5 weeks 5 years 6 days 6 months 6 years 7 months 7 years
652 11 992 50 588 670 288 531
8 months 8 years 9 months 9 years <NA>
402 536 224 288 18
train[, c(5,7,8)] <- train %>% select(OutcomeSubtype, SexuponOutcome, AgeuponOutcome) %>%
mutate_each(funs(replace(., is.na(.), "Unknown")))train %>% summarise_all(funs(sum(is.na(.))))# A tibble: 1 × 11
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome
<int> <int> <int> <int> <int> <int> <int>
1 0 7691 0 0 0 0 0
# ℹ 4 more variables: AgeuponOutcome <int>, Breed <int>, Color <int>,
# NoName <int>
Fixing that AgeuponOutcome variable
head(train$AgeuponOutcome)[1] "1 year" "1 year" "2 years" "3 weeks" "2 years" "1 month"
train$AgeuponOutcome <- gsub('s', '', train$AgeuponOutcome)table(train$AgeuponOutcome)
0 year 1 day 1 month 1 week 1 year 10 month 10 year 11 month
22 66 1281 317 3969 457 446 166
11 year 12 year 13 year 14 year 15 year 16 year 17 year 18 year
126 234 143 97 85 36 17 10
19 year 2 day 2 month 2 week 2 year 20 year 3 day 3 month
3 99 3397 529 3742 2 109 1277
3 week 3 year 4 day 4 month 4 week 4 year 5 day 5 month
659 1823 50 888 334 1071 24 652
5 week 5 year 6 day 6 month 6 year 7 month 7 year 8 month
11 992 50 588 670 288 531 402
8 year 9 month 9 year Unknown
536 224 288 18
create a numeric age value:
train$Age <- sapply(train$AgeuponOutcome, function(x) strsplit(x, split = ' ')[[1]][1]) %>% as.numeric()
head(train$Age)[1] 1 1 2 3 2 1
correct the various units of time
The vectorized function ifelse() works nicely here
train$Unit <- sapply(train$AgeuponOutcome, function(x) strsplit(x, split = ' ')[[1]][2])
train$AgeInDays <- ifelse(train$Unit == 'day', train$Age,
ifelse(train$Unit == 'week', train$Age*7,
ifelse(train$Unit == 'month', train$Age*30,
ifelse(train$Unit == 'year', train$Age*365, NA))))Always check your work
datatable(train[, c(8, 12:14)])Making more of that DateTime
train$hour <- hour(train$DateTime)
train$DOW <- wday(train$DateTime)
train$month <- month(train$DateTime)
train$year <- year(train$DateTime)Consider squeezing more information out of the data, Time of day can help us figure out when they are open:
plot(table(train$hour), type="h")
train$period <- ifelse(train$hour > 8 & train$hour < 11, 'early',
ifelse(train$hour > 10 & train$hour < 17, 'midday',
ifelse(train$hour > 16 & train$hour < 19, 'endday', 'overnight')))there is a problem with the ordering
ggplot(train, aes(x = period, y = AgeInDays)) + geom_boxplot() + theme_classic()
Use factor levels to order non-numeric data
train$period <- factor(train$period,
levels = c('early', 'midday',
'endday', 'overnight'))check the results
ggplot(train, aes(x = period, y = AgeInDays)) +
geom_boxplot(fill=c("red", "orange","yellow", "green")) +
theme_classic()
Always, Always Check The Data
You should probably run a quick summary or some other function which generates ranges and tabulates NAs to make certain that there are no unusual observations
summary(train) AnimalID Name DateTime
Length:26729 Length:26729 Min. :2013-10-01 09:31:00.00
Class :character Class :character 1st Qu.:2014-05-31 16:31:00.00
Mode :character Mode :character Median :2014-12-13 17:10:00.00
Mean :2014-12-19 00:22:23.96
3rd Qu.:2015-07-19 19:48:00.00
Max. :2016-02-21 19:17:00.00
OutcomeType OutcomeSubtype AnimalType SexuponOutcome
Length:26729 Length:26729 Length:26729 Length:26729
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
AgeuponOutcome Breed Color NoName
Length:26729 Length:26729 Length:26729 Mode :logical
Class :character Class :character Class :character FALSE:19038
Mode :character Mode :character Mode :character TRUE :7691
Age Unit AgeInDays hour
Min. : 0.000 Length:26729 Min. : 0.0 Min. : 0.00
1st Qu.: 2.000 Class :character 1st Qu.: 60.0 1st Qu.:12.00
Median : 2.000 Mode :character Median : 365.0 Median :15.00
Mean : 3.628 Mean : 794.1 Mean :14.45
3rd Qu.: 5.000 3rd Qu.:1095.0 3rd Qu.:17.00
Max. :20.000 Max. :7300.0 Max. :23.00
NA's :18 NA's :18
DOW month year period
Min. :1.000 Min. : 1.000 Min. :2013 early : 1683
1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:2014 midday :15195
Median :4.000 Median : 7.000 Median :2014 endday : 7846
Mean :3.976 Mean : 6.926 Mean :2014 overnight: 2005
3rd Qu.:6.000 3rd Qu.:10.000 3rd Qu.:2015
Max. :7.000 Max. :12.000 Max. :2016
consider building a dplyr solution
library(dplyr)
library(tidyr)
train %>% select(14:18) %>%
summarise_each(funs(min = min(., na.rm = TRUE),
q25 = quantile(.,probs = 0.25, na.rm = TRUE),
median = median(.,na.rm = TRUE),
q75 = quantile(.,probs = 0.75, na.rm = TRUE),
max = max(.,na.rm = TRUE),
mean = mean(.,na.rm = TRUE),
sd = sd(.,na.rm = TRUE))) %>%
gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean, sd) # reorder columns# A tibble: 5 × 8
var min q25 median q75 max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AgeInDays 0 60 365 1095 7300 794. 1083.
2 DOW 1 2 4 6 7 3.98 2.07
3 hour 0 12 15 17 23 14.4 3.34
4 month 1 4 7 10 12 6.93 3.50
5 year 2013 2014 2014 2015 2016 2014. 0.741
Variables of Substance
Knowing the number of unique values can help us determine whether a variable is suitable for analysis. Sometimes, a variable with many unique values is really just an identifier that is more useful for joining (merging) than for analysis. An example would be student id number, social security numbers, cell phone number etc.
You can use functions such as unique() to find out how many unique values a given variable has:
sapply(train, function(x) length(unique(x))) AnimalID Name DateTime OutcomeType OutcomeSubtype
26729 6373 22918 5 17
AnimalType SexuponOutcome AgeuponOutcome Breed Color
2 5 44 1380 366
NoName Age Unit AgeInDays hour
2 22 5 44 20
DOW month year period
7 12 4 4
train %>% summarise_all(funs(n_distinct(.)))# A tibble: 1 × 19
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome
<int> <int> <int> <int> <int> <int> <int>
1 26729 6373 22918 5 17 2 5
# ℹ 12 more variables: AgeuponOutcome <int>, Breed <int>, Color <int>,
# NoName <int>, Age <int>, Unit <int>, AgeInDays <int>, hour <int>,
# DOW <int>, month <int>, year <int>, period <int>
tidyr is useful for this kind of aggressive data restructuring
train %>%
gather(var, value) %>%
distinct() %>%
count(var)# A tibble: 19 × 2
var n
<chr> <int>
1 Age 22
2 AgeInDays 44
3 AgeuponOutcome 44
4 AnimalID 26729
5 AnimalType 2
6 Breed 1380
7 Color 366
8 DOW 7
9 DateTime 22918
10 Name 6373
11 NoName 2
12 OutcomeSubtype 17
13 OutcomeType 5
14 SexuponOutcome 5
15 Unit 5
16 hour 20
17 month 12
18 period 4
19 year 4
Variables with few values can serve as grouping variables (good variables to use for subsetting).
String work
The variables that typically have too many values are strings, but we can be creative. There are 366 colors, too many to view on screen, but we can table and save to a data frame:
colors <- train %>% count(Color)
datatable(colors)colors <- colors %>% mutate(color_rank = dense_rank(desc(n)))
datatable(colors)train2 <- inner_join(train, colors[, c(1,3)])breed
There is a lot of information here, but we can work with it using string functions.
train2$mix <- train2$Breed %>% str_detect(., "/|Mix")
datatable(train2[, c("Breed","mix")])You could probably think of more things we could do with our variables, like popular names, breed information, etc.
- We begin with graphical tools to generate basic plots of the data to illuminate patterns
- We then move into more complex plots to search for deviations from these patterns
Variation
- Variability is interesting and typically one will examine variables with the most variation.
- There is variability in balance
- What is the source(s) of this variation?
One of the most important things to remember is the type of variable – do I have a numerical or categorical and if I have a numerical – is it discrete or continuous?
ggplot(train2, aes(x=AgeInDays)) +
geom_histogram(aes(y=..density..),bins=20, fill="cyan", color="black") +
ggtitle("Age distribution of shelter animals at outcome") +
theme_wsj()
ggplot(train2, aes(x=AgeInDays)) +
geom_histogram(aes(y=..density..),bins=20, fill="cyan", color="black") +
facet_wrap(~OutcomeType) +
ggtitle("Age distribution of shelter animals at outcome") +
theme_tufte()
ggplot(train2, aes(x=AgeInDays,fill=factor(AnimalType))) +
geom_histogram(aes(y=..density..)) +
facet_wrap(~OutcomeType + AnimalType) +
ggtitle("Age distribution of shelter animals at outcome") +
theme_solarized() + theme(legend.position="none")
ggplot(train2, aes(x=AgeInDays,fill=factor(AnimalType))) +
geom_histogram(aes(y=..density..), color = "black") +
facet_wrap(OutcomeType ~ AnimalType, nrow=5) +
ggtitle("Age distribution of shelter animals at outcome") +
theme_pander() + theme(legend.position="none")
Outliers or Unusual Values
We might transform the variable “AgeInDays” to help us better understand it or perhaps make it more presentable.
Usually long right hand tails can use a square root or a log transformation to help normalize them or at least bring the relationship into clearer focus:
train2$AO <- paste(train2$AnimalType, train2$OutcomeType, sep="-")
ggplot(train2, aes(x=AO, y=sqrt(AgeInDays))) +
geom_boxplot(fill=rep(c("red","blue"), each=5)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
So perhaps outliers are a combination of age, animal type and outcome. Old cats are not usually adopted (and old for a cat is about 2 years) and a dog is not “old” unil 6 or 7 years of age.
Working with groups
Tables are a start when working with categorical (non-numeric) variables
table(train2$OutcomeType, train2$SexuponOutcome, train2$AnimalType), , = Cat
Intact Female Intact Male Neutered Male Spayed Female Unknown
Adoption 141 97 2001 2033 0
Died 40 64 9 11 23
Euthanasia 206 231 108 75 90
Return_to_owner 18 10 252 213 7
Transfer 1709 1525 695 680 896
, , = Dog
Intact Female Intact Male Neutered Male Spayed Female Unknown
Adoption 62 61 3221 3153 0
Died 16 15 10 7 2
Euthanasia 195 246 236 157 11
Return_to_owner 283 467 1995 1535 6
Transfer 841 809 1252 956 59
prop.table(table(train2$OutcomeType, train2$SexuponOutcome, train2$AnimalType),c(1,3)), , = Cat
Intact Female Intact Male Neutered Male Spayed Female
Adoption 0.033005618 0.022705993 0.468398876 0.475889513
Died 0.272108844 0.435374150 0.061224490 0.074829932
Euthanasia 0.290140845 0.325352113 0.152112676 0.105633803
Return_to_owner 0.036000000 0.020000000 0.504000000 0.426000000
Transfer 0.310445050 0.277020890 0.126248865 0.123524069
Unknown
Adoption 0.000000000
Died 0.156462585
Euthanasia 0.126760563
Return_to_owner 0.014000000
Transfer 0.162761126
, , = Dog
Intact Female Intact Male Neutered Male Spayed Female
Adoption 0.009542866 0.009388949 0.495767277 0.485300908
Died 0.320000000 0.300000000 0.200000000 0.140000000
Euthanasia 0.230769231 0.291124260 0.279289941 0.185798817
Return_to_owner 0.066028931 0.108959403 0.465468969 0.358142790
Transfer 0.214705131 0.206535614 0.319632372 0.244064335
Unknown
Adoption 0.000000000
Died 0.040000000
Euthanasia 0.013017751
Return_to_owner 0.001399907
Transfer 0.015062548
table(train2$OutcomeType, train2$SexuponOutcome, train2$AnimalType) %>%
prop.table(c(2,3)) %>%
round(3), , = Cat
Intact Female Intact Male Neutered Male Spayed Female Unknown
Adoption 0.067 0.050 0.653 0.675 0.000
Died 0.019 0.033 0.003 0.004 0.023
Euthanasia 0.097 0.120 0.035 0.025 0.089
Return_to_owner 0.009 0.005 0.082 0.071 0.007
Transfer 0.808 0.791 0.227 0.226 0.882
, , = Dog
Intact Female Intact Male Neutered Male Spayed Female Unknown
Adoption 0.044 0.038 0.480 0.543 0.000
Died 0.011 0.009 0.001 0.001 0.026
Euthanasia 0.140 0.154 0.035 0.027 0.141
Return_to_owner 0.203 0.292 0.297 0.264 0.077
Transfer 0.602 0.506 0.186 0.165 0.756
barplot
ggplot(train2, aes(x = SexuponOutcome, y = ..count.., fill = OutcomeType)) +
geom_bar(stat = 'count', position = 'fill', color = 'black') +
facet_wrap(~AnimalType) +
coord_flip() +
labs(y = 'Proportion of Animals',
x = 'Sex',
title = 'Outcomes by Day of Week') +
theme_fivethirtyeight()
heat map
my_palette <- brewer.pal(n = 11, name = "Spectral")
train2 %>%
count(AO, period) %>%
ggplot(mapping = aes(x = AO, y = period)) +
geom_tile(mapping = aes(fill = n), alpha = 0.75) +
scale_fill_gradientn(colors = my_palette) +
labs(x = "Animal Outcome", y = "Time Period", title = "Animal Outcomes by time period") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) 